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Gene amplification occurs in most solid tumors and is associated with poor prognosis. Amplification of 20q13.2 
is common to several tumor types including breast cancer. The 1 Mb of sequence spanning the 20ql3.2 breast 
cancer amplicon is one of the most exhaustively studied segments of the human genome. These studies have 
included amplicon mapping by comparative genomic hybridization {CGH) 7 fluorescent in-situ hybridization 
(FISH), array-CGH, quantitative microsatellite analysis (QUMA), and functional genomic studies. Together these 
studies revealed a complex amplicon structure suggesting the presence of at least two driver genes in some 
tumors. One of these, ZNF217, is capable of immortalizing human mammary epithelial cells (HMEC) when 
overexpressed. In addition, we now report the sequencing of this region in human and mouse, and on 
quantitative expression studies in tumors. Amplicon localization now is straightforward and the availability of 
human and mouse genomic sequence facilitates their functional analysis. However, comprehensive annotation of 
mega base-scale regions requires integration of vast amounts of information. We present a system for integrative 
analysis and demonstrate its utility on 1.2 Mb of sequence spanning the 20ql3.2 breast cancer amplicon and 865 
kb of syntenic murine sequence. We integrate tumor genome copy number measurements with exhaustive 
genome landscape mapping, showing that amplicon boundaries are associated with maxima in repetitive element 
density and a region of evolutionary instability. This integration of comprehensive sequence annotation, 
quantitative expression analysis, and tumor amplicon boundaries provide evidence for an additional driver gene 
prefoldin 4 [PFDN4), coregulated genes, conserved noncoding regions, and associate repetitive elements with 
regions of genomic instability at this locus. 



Genome scanning techniques such as Comparative Ge- 
nomic Hybridization (CGH), Restriction Landmark Ge- 
nome Scanning, and analysis of Loss of Heterozygosity 
(LOH) have mapped numerous regions of recurrent ge- 
nome copy number abnormality in human solid tu- 
mors (Gray and Collins 2000). In breast tumors alone, 
>30 such regions have been identified (Kallioniemi et 
al. 1994) and the genomes of most other tumor types 
are similarly affected (Knuutila et al. 1998, 1999). Such 
aberrant loci are thought to encode proteins that par- 
ticipate in tumor progression as a result of altered gene 
dosage, translocations, and/or mutation. Typically, 
these "cancer genes" are identified by narrowly defin- 
ing regions of recurrent loss or gain followed by func- 
tional assessment of candidate genes. This approach is 
becoming increasingly efficient with the development 
of high-resolution genome scanning techniques such 

Corresponding author. 

E-MAIL collIns@cc.ucsf.edu; FAX (415) 476-8218. 

Article published on-line before print: Genome Res., 10.1 101 /gr.l 74301 . 
Article and publication are at www.genome.org/cgi/doi/10.110l/ 
gr.l 74301. 



as array CGH (Pinkel et al. 1998; Albertson et al. 2000). 
However, the mapping information from these tech- 
niques will be most informative only when integrated 
with well-annotated genomic sequence. To accomplish 
this, we have developed and applied a suite of software 
tools collectively called Genome Cryptographer (GC) 
to facilitate integrative analysis. GC collects genome 
sequence information from multiple databases and vi- 
sually displays it in analysis intervals (AIs) of constant 
width along the genome. Displayed information in- 
cludes CpG density, sequence tagged sites (STSs), ex- 
pressed sequence tag (EST) clusters, locations and den- 
sities of repeated sequences (e.g., Alus, SINEs, LINEs), 
duplicons, similarities with syntenic murine se- 
quences, known genes and genome copy number de- 
termined using array CGH. 

We applied GC to the analysis of 1.2 Mb of 
20ql3.2 because it is amplified in a wide range of tu- 
mor types (Kallioniemi et aL, 1994, 1998, 1999), ap- 
pears to be an early event in breast cancer (Werner et 
al. 1999), and is associated with aggressive tumor be- 
havior (Tanner et al. 1995), immortalization (Savelieva 
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et al. 1997; Cuthill et al. 1999), and genome instability 
(Savelieva et al. 1997). The entire region is amplified in 
the majority of breast tumors with gain at 20ql3.2 
(Tanner et al. 1994, 1996). However, high-resolution 
fluorescent in-situ hybridization (FISH) (Collins et al. 
1998), quantitative microsatellite analysis (QUMA) 
(Ginzinger et al. 2000), and array CGH (Albertson et al. 
2000) mapping elucidated a complex amplicon struc- 
ture with two regions of recurrent amplification sepa- 
rated by -600 kb (Albertson et al. 2000), one region 
containing ZNF217 (Collins et al. 1998) and the other 
CYP24 (Albertson et al. 2000). Overexpression of 
ZNF217 immortalizes cultured human mammary epi- 
thelial cells (HMEC) (Nonet et al. 2001) and overex- 
pression of CYP24 has been postulated to interfere 
with vitamin D mediated differentiation (Albertson et 
al. 2000). Nevertheless, other genes in the amplicon 
peaks also may contribute to cancer progression. Ac- 
cordingly, we sequenced and computationally ana- 
lyzed the entire 1.2-Mb region to catalog all genes in 
the region and to attempt to identify structural features 
in the DNA sequence that might underlie local insta- 
bility. 

RESULTS AND DISCUSSION 

Figure 1 shows a GC analysis of a 1.2-Mb region of 
amplification at 20ql3.2. This analysis identified six 
previously identified genes (Collins et al. 1998) as well 
as four genes (NABC3 [Novel gene Amplified in Breast 
Cancer], NABC4, NABC5, and prefoldin 4 [PFDN4]) 
that previously were not known to be present in this 
region (Fig. 1A) (Multiple gene prediction algorithms 
were used to find genes; however, these analyses failed 
to provide convincing evidence for additional coding 
sequences, and thus the data were not included.) We 
then manually integrated GC output and array-CGH 
data to map genes relative to amplicon peaks at ge- 
nome sequence resolution and to identify sequence 
features that might play a role in the amplification 
process (Fig. 1A). The array-CGH mapping was per- 
formed with a contiguous set of bacterial artificial 
chromosome (BAC) clones spanning this amplicon (Al- 
bertson et al. 2000). Boxes indicate the genomic inter- 
val for which copy number was measured, and color 
corresponds to copy number with crimson represent- 
ing highest copy number. The triangles point to am- 
plicon boundaries defined as clusters of amplification 
breakpoints previously identified in primary tumors 
and breast-cancer cell lines. 

The GC analysis suggests the possibility that re- 
petitive elements are involved in amplification at 
20ql3.2. Figure 1 shows a markedly uneven distribu- 
tion of the density and type of repetitive elements 
across the region. Earlier FISH- and array CGH-based 
studies (Collins et al. 1998; Albertson et al. 2000) 
mapped amplicon boundaries with a high degree of 



precision and revealed two classes of tumors. In one 
class, the copy number maximum is centered on the 
ZNF217-NABC3 locus (Collins et al. 1998). In the sec- 
ond class, a larger amplicon includes both the ZNF217- 
NABC3 and CYP24-PFDN4 loci (Albertson et al. 2000) 
with the copy number peak centered on the CYP24- 
PFDN4 locus. In the first class of tumors, the proximal 
boundary was mapped by FISH in two tumors (Collins 
et al. 1998) and refined by Southern blot mapping in 
one (C. Collins, unpubl.) to within 10 kb of the 
ZNF217 gene's 3 untranslated region (UTR). The distal 
boundary was mapped in three independent tumors 
and one cell line (Collins et al. 1998). In the second 
class, the boundary distal to CYP24-PFDN4 was 
mapped to within a single BAC in two tumors (Albert- 
son et al. 2000). Interestingly, the average density of 
repetitive elements flanking amplicon boundaries is 
below 40%; however, each of the three amplicon 
boundaries fall into regions of >60% repetitive DNA 
content. Repetitive elements (e.g., Alu and LI) have 
been implicated in recombination (Moran et al. 1999), 
genome evolution (Brosius 1999) and disease-related 
aberrations (Huie et al. 1999). Thus, the association of 
high repetitive element density with regions of fre- 
quent chromosome breakage suggests a possible role 
for repetitive elements in the amplification process 
(e.g., as sites for recombination-driven amplification). 

GC analysis also revealed a 14-Kb duplicon (Ei- 
chler 1998) 167 bp distal to ZNF217. This is significant 
because duplicons have been associated with evolu- 
tionarily unstable chromosomal loci in primates. Ho- 
mologous recombination between duplicons has been 
implicated in the formation of duplications, deletions, 
inversions, translocations, and formation of supernu- 
merary marker chromosomes (Ji et al. 2000), some of 
which are disease-related (Eichler 1998; Christian et al. 
1999; Peoples et al. 2000). Thus, this element may play 
a role in amplification of the ZNF217-NABC3 locus in 
cancer. The duplicon includes NABC3 and a CpG is- 
land and is -97% identical to elements found on the 
long arms of chromosomes 15q and 22q (Fig. 2). Hy- 
bridization of probes spanning the duplicon to the 
CalTech D BAC library resulted in identification of 16 
BAC clones. These were FISH-mapped to chromosomes 
4p, 12q, 15q, 21q (Fig. 2), 20q, and 22q. In addition, 
some of the BAC clones decorated the pericentromeric 
regions of multiple chromosomes (data not shown). 
Although we do not know if each mapped BAC con- 
tains a complete element, we do know from GC analy- 
sis that chromosome arms 20q, 22q, and 15 q do in fact 
have complete elements, and that chromosomes 10, 
21, and 13 harbor fragments of the duplicon. 

The degree of sequence conservation and pattern 
of chromosomal distribution provides compelling evi- 
dence that this element is indeed a duplicon (Eichler 
1998). A retroviral LTR inserted in the chromosome 22 
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element disrupts the paralogous NABC3 gene (Fig. 2). 
Comparative analysis of human and syntenic mouse 
sequence identified an orthologous NABC3 gene at 
mouse chromosome 2H3 (syntenic to human chromo- 
some 20ql3.2). In addition; the position, size, and pre- 
sumably function of the 1.8-Kb CpG island also is con- 
served (Fig. 3A). FISH mapping indicates that in mouse 
the NABC3 gene is single copy (data not shown). This 
finding is consistent with the current view that dupli- 
cons do not occur outside of primates (Eichler et al. 
1999). Thus, duplicon's pangenomic migration most 
likely occurred after the primate-mouse divergence 
with 20ql3.2 being the ancestral element. The finding 
of a duplicon within 20ql3.2 amplicon is intriguing 
however, in the absence of data regarding the presence 
of duplicons in other amplicons, its role in mediating 
amplification remains unclear. 

The 1.2-Mb encodes 10 genes, three CpG islands, 



Figure 1 Integration of genome copy number and genome 
sequence information in a region of amplification at 20q1 3.2. (A) 
Genome Cryptographer (GC) analysis of a 1.2-Mb region of 
amplification. Average genome copy number values in se- 
lected tumors (S50, S59, S21) measured using array Compara- 
tive Genomic Hybridization (CGH) (Albertson et al. 2000) are 
shown as color-coded bars at the top of the figure. The array 
CGH data were obtained using a contig of BAC clones that 
now have been sequenced. Brick red lines represent public 
draft assemblies as of 2.1.01. Pink lines correspond to the 
exact size and position of the BAC clones used in the study. 
Densities and classification of repetitive elements are shown 
in color-coded cumulative bar chart above the X axis. CpG 
dinucleotide densities are plotted below the X axis as open 
green boxes. Sequence features such as genes are shown as 
horizontal lines above the X axis spanning the total extent of 
the sequence similarity. Exons are shown in bold lines. Genes 
and pseudogenes are represented by blue arrows pointing in 
the direction of transcription. The names of genes appear be- 
low the CGH copy number plot in black bold font. Total num- 
ber of gene/EST hits and/or mouse identity regions are pre- 
sented below the X axis as red or blue circles, respectively. 
Aquamarine triangles with bars, indicating the mapping reso- 
lution, mark the approximate positions of amplicon bound- 
aries mapped by array CGH (Albertson et al. 2000), fluores- 
cent in-situ hybridization (FISH) (Collins et al. 1998) and 
Southern hybridization (Collins et al., unpubl.). This figure 
can also be viewed at http://shark.ucsf.edu:8080/~stas/ 
GR2001/index.html. (B) Enlargement of the ZNF217-NABC3 
region of 20ql3.2 amplification. This panel further illustrates 
the ability of GC to annotate features such as public draft 
sequence assembly (orange), BAC template locations (pink), 
STSs (dark green), alignment of syntenic murine sequence 
(light blue line), human/murine sequence identities (light 
blue rectangle on line), human genes (dark blue), duplications 
and other identities to human genomic sequence (black). The 
locations of genome duplications (e.g., Chrl5_AC015713) are 
identified above the black line indicating the chromosome 20 
location of each duplicon. Ratios shown beneath EST clusters 
correspond to the total number of EST hits/total murine EST 
hits. Numbers under blue circles indicate the total number of 
murine sequence identities per analysis interval. (C) ZNF217- 
EGFP fusion proteins localize to the nucleus of HeLa cells and 
are excluded from the nucleoli. The top two panels show lo- 
calization of ZNF217-GFP fusion and the bottom two panels 
show DAPI staining of cell nuclei. 



and two pseudogenes. The ZNF217, NABC3, CYP24, 
and PFDN4 genes are of particular interest because they 
are located at amplification maxima. ZNF21 7 has been 
shown to immortalize HMEC and thus has properties 
consistent with it being a bona fide oncogene. Struc- 
turally, ZNF21 7 resembles a transcription factor having 
eight C2H2 motifs, a nuclear localization signal, and a 
proline-rich domain (Collins et al. 1998). The NABC3 
cDNA has a poly-A tail, lacks an open reading frame, 
does not share identity with any known genes, lacks 
introns, and is expressed in a wide range of tissues 
(data not shown). Analysis of the predicted RNA sec- 
ondary structure using MFOLD (http://bioweb.pasteur.fr/ 
seqanal/interfaces/mfold.html) shows that it is unusu- 
ally stable. These features suggest that NABC3 may en- 
code an RNA gene rather than a processed pseudogene. 
Its possible role in cancer remains unclear. PFDN4 is a 
subunit of the heterohexameric chaperone protein pre- 
foldin family (Vainberg et al. 1998). It captures un- 
folded actin and tubulin for delivery of cytosolic chap- 
erone (CTT) (Vainberg et al. 1998; Hansen et al. 1999). 
PFDN4 may function as a transcription factor or cofac- 
tor in cell-cycle regulation (Iijima et al. 1996). 

Expression levels of ZNF217, NABC3, and PFDN4 
were analyzed in normal cultured human breast epi- 
thelial cells, breast-cancer cell lines, and primary tu- 
mors (Fig. 4) using quantitative reverse transcriptase- 
polymerase chain reaction (RT-PCR). Expression of 
NABC3 was strikingly similar to that of ZNF217, in- 
cluding high-level expression in the cell lines 600MPE 
and T47D in which they are not amplified. The coor- 
dinate expression of ZNF217 and NABC3 suggests uti- 
lization of common regulatory elements. To identify 
putative regulatory elements, we aligned syntenic 
mouse sequence spanning the ZNF217-NABC3 locus 
(Fig. 3). This alignment and a percent identity plot 
(PIP) analysis (http://nog.cse.psu.edu/pipmaker/) iden- 
tified several conserved noncoding elements in and 
around the region encoding ZNF217-NABC3. In Figure 
3A, these regions of conserved noncoding DNA in and 
flanking ZNF217 are circled. A cluster of such motifs 
occurs in and proximal to the 3' untranslated region, 
in the first intron, and distal to the first exon. An ex- 
ample of an actual sequence alignment for one of the 
elements circled in red is shown in Figure 3B. These 
candidate regulatory elements now can be assessed for 
activating mutations and epigenetic modifications in 
600MPE, T47D, and primary breast tumors in which 
ZNF217 and NABC3 are overexpressed in the absence 
of amplification. PFDN4 was overexpressed in cell lines 
in which it was amplified. Thus, both NABC3 and 
PFDN4 remain viable candidate oncogenes requiring 
further biological assessment. It will be important to 
determine if a synergistic relationship exists between 
these genes and ZNF217. 

Next we extended this functional genomic analy- 



Genome Research 1037 

www.genome.org 



Collins et al. 



<1> o~ ±: 




b$z aiuosoiuoji|3 



3 ° c = -Q 



1038 Genome Research 

www.genome.org 



Sequence Analysis of a Cancer Amplicon 










































phr 


28 




LI! 


»7fi 


138 


> 












































id 






ce 




















































nf 




J 






















































5 




4 










3 




2 






1 








< 


hr 


15 






57 


13 


> 
















Isl 


an 


d 






Exons 


[h 








































I 


:hi 


22 






|5C 


) 






■H 




































>lic 


wo 






i 










-1 


















1 




L 











Human: ccatcctcagatccgtcttcagaaccaccttcccctcgatccacggctccattttcatcc 
I I I I I t I I I I I I I I I I I I I I I I II I I I I I Ml lit I I I I Ml I II Ml 
Mouse: ccatcctcagacccgtcttcagagcccccttc tcggtccccggccccactgtcttcc 



Human: agaggggcggcgaggtcaggagaacacgtccccggctgcctcccgtccacagacatggtg 
MM II I I I I I II II II II Mill llllllll I I I M I 1 1 I I I I I I I 
Mouse: agagtggtgctgaggtctggggagcaggtcccaggctgccttgcatccacagccatggtg 



Human: ggcgactccgcgccggccctccggtccttcttgtggaccctggagtgcaagaccagctgg 
M III II I I I I I II II 1 1 M I I I I I I I I I I II I I I II III I I I It I 
Mouse: ggtgacagggcatcagtcctcctgtccttcctgtgcaccctcgagtgcaggacgagctgg 



Human: tggtaggttctgaaagctttgccgcactcggagcagtgagtgggcttctccttgctactg 
II I I I II II I It II II 1 1 I II I II I II I I I I I I II I M I I I M 1 1 I M M I II 
Mouse: tggtatgtcctgaaggctttgctgcactcagaacagtgcgtgggcttctccttgctactg 



Human: ggtaacttggg 
I llllllll 
Mouse: gacaacttggg 

Figure 3 (A) A high-resolution Genome Cryptographer (GC) analysis showing human/mouse sequence alignment. CC analysis was 
carried out in an analysis interval of 1 kb. This figure shows a chromosome 20 PAC (ALT 57838) in black. The extent of syntenic mouse 
sequence is indicated by a thin blue line with sequence identities shown as heavy lines. Human genes ZNF217 and NABC3 appear as dark 
blue arrows pointing in the direction of transcription. Bracketed lines show interchromosomal duplications. Their extent is shown as thin 
black lines with actual sequence identities indicated by heavy black lines (e.g., Chrl 5, AC01 571 3). (B) Sequence alignment of noncoding 
conserved human and mouse sequence (circled in red on the CC analysis in A). 



sis to the protein level. As one of the first steps of the 
systematic functional annotation of all proteins iden- 
tified in the amplicon, we sought to determine their 
subcellular localization. Because ZNF217 maps to a 
narrow tumor amplicon, is overexpressed in all tumors 
in which it is amplified and some in which it is not, 
and can immortalize HMECs upon ectopic expression, 
we sought to determine its subcellular localization 
first. To this end we constructed a vector expressing a 
ZNF217-green fluorescent protein (GFP) fusion and 
microinjected this construct into HeLa cells. As shown 
in Figure 1C, the ZNF217-GFP fusion localizes to the 
nucleus in a punctate pattern. These data are consis- 
tent with the presence of nuclear localization signals in 
ZNF217 identified by psort (http://psort.nibb.ac.jp/). 
Of the two genes mapped to distal amplicon peak, 
CYP24 has been localized to the mitochondria in pre- 
vious studies (Beckman and DeLuca 1997) and a manu- 



script is in preparation with detailed analysis of PFDN4 
including its subcellular localization. 

Finally, we note that the 1.2 Mb of assembled se- 
quence reported here is consistent with the NCBI draft 
assembly (NT_011484 and NT_019675) and bridges the 
gap between these two sequence contigs (Fig. 1A). 
However, GC analysis did reveal a number of annota- 
tion errors and sequencing artifacts present in the pub- 
lic database. We found 11 regions of putative sequence 
identity between this sequence and chromosome 5 av- 
eraging -300 bp. Exhaustive characterization of these 
regions including RH mapping, and PCR on individual 
BAC clones from chromosomes 5 and 20, showed con- 
clusively that the identities are, in fact, chromosome 
20 sequences contaminating that of chromosome 5 
BACs. In addition, we identified two BACs (AC026267, 
AC021970) annotated as chromosomes 4 and 16, re- 
spectively. These BACs share >99% sequence identity 
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tiling path of clones was selected 
for genomic sequencing (Fig. 1A). 
Sequencing was performed at the 
Department of Energy's Joint Ge- 
nome Institute (http://www.jgi. 
doe.gov) and resulted in the assem- 
bly of -1.2 Mb. In addition, 865 kb 
of murine draft sequence spanning 
the gene ZNF217 was generated. 
Genomic and comparative se- 
quence analyses were performed us- 
ing Sequin (ftp://ncbi.nlm.nih. 
gov/sequin/), enhanced with a suite 
of programs for automation of data 
entry, PIP (http://bio.cse.psu.edu/ 
pipmaker/), and Genome Cryptog- 
rapher. 



ZNF217 



NABC3 



PFDN4 



Figure 4 RNA expression levels of ZNF21 7, NABC3, and PFDN4 in six cell lines and four 
mammary tumors. Transcript levels are calculated as 2 " AN (Albertson et al. 2000) with GAPDH 
as a reference gene and relative to the expression levels as measured in the human mammary 
epithelial cells (HMECs). As a control, expression levels were measured with GUS as a reference 
gene, which also showed nearly identical expression profiles for ZNF217 and NABC3 (not 
shown). Cultured HMECs, cell lines MCF7, MDA436, BT474, 600MPE, T47D, and MKN7, pri- 
mary tumors SI 552, SI 526, SOI 1 7, and S0055 were used as a source of template mRNA for this 
experiment. 



in entirety and contain only chromosome 20 STSs. 
Thus, we conclude that these represent annotation er- 
rors. The graphical representation of the 1.2 Mb se- 
quence immediately revealed the presence and extent 
of both types of artifacts and facilitated the design of 
experiments to distinguish artifacts from real paralo- 
gous sequences. 

Conclusion 

This is the first tumor amplicon to be completely se- 
quenced and biologically annotated. GC analysis pro- 
vides a comprehensive view of the genomic landscape 
including distribution of genes, repetitive elements, 
duplications, cross-species homologies, and amplicon 
structure and suggests the possibility that NABC3 and 
PFDN4 may play a role in cancer progression. These 
results also suggest that repeated sequences and/or du- 
plications may be involved in aberration formation 
and indicate specific genomic sequences that can be 
interrogated to test this hypothesis. Integration of high- 
resolution array CGH data with genomic sequence in 
other recurrent amplicons will provide an important 
test of the overall importance of repeat sequences and 
duplicons in gene amplification in humans. 

METHODS 

Genome Sequence 

A BAC and PI contig was assembled between D20S902 and 
D20S609 as described by (Collins et al. 1998) and a minimum 



Accession Numbers 

Human BAC and PI clone accession 
numbers are as follows: BAC109: 
AC004499, P141: AC004505, PI 30: 
AC004504, BAC185: AC005808, 
BAC189: AC005914, P12: 
AC006076, P128: AC004762, 
BAC99: AC005220, BAC121: 
AC004501, HI 19: AF312913, P139: 
AF312912, H79/H117: AF312915, 
H143: AF312914. Mouse BAC clone 
accession numbers are as follows. 
Ml: AC023610, M10: AC073667, 
and Ml 2: AC073727. All accession numbers are from Gen- 
Bank. 

Sequence and Copy Number Annotation 

We have developed Genome Cryptographer (GC), which is 
a suite of Perl programs to facilitate megabase-scale analysis of 
genomic sequence (Fig. 5). This suite is built of separate mod- 
ules that exchange information via intermediate text files. 
Data in intermediate files are written in a consistent format: 
sequence name, sequence length, window size, appropriate 
data for a given window (the number of these "data" lines 
equals the number of windows that are contained per se- 
quence and, optionally, after a blank line, annotation data. 
Analysis of the sequence is done in the following stages: 
Using script gc_plot.pl, we generate the plot of the GC- 
content and number of CpG dinucleotides per Al. The CpG 
dinucleotide density is weighted by adding 0.25 to the di- 
nucleotide count for each CpG dinucleotide that is found 
within 20 bp of another. This makes CpG islands more ap- 
parent as peaks in CpG dinucleotide density plots. The script 
also produces the graphic plot of the GC- and CpG-content 
and, if available, can annotate the plot with features from the 
output of the count_gene.pl script (making it easier to correlate 
changes in GC and CpG content with sequence features). 

The sequence is analyzed for repeats using publicly avail- 
able RepeatMasker program (Smit and Green, http:// 
repeatmasker.genome.washington.edu/cgi-bin/RM2_req, pi). 
RepeatMasker output files are saved. Masked sequence is 
used for searches of public and proprietary databases. Cur- 
rently, GC employs the NCBI version of blast (ftp:// 
ncbi.nlm.nih.gov/blast/). Sequence is compared to nonredun- 
dant, HTGS, dbSTS, and dbEST divisions of GenBank. Se- 
quence similarity criteria are set to reduce the probability of 
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Figure 5 Genome Cryptographer (GC) flowchart. The 
names of programs are given above solid arrows lacking feathers. 
Programs from the public domain are shown in italics. The final 
graphics output is presented in pentagrams. Intermediate data 
are shown in rectangles. Input of information into the graphics 
module (graph.pl) is shown by feathered arrows. A module for 
integrating expression and copy number array data is under de- 
velopment. GC and GC tutorial are available at http:// 
kinase.ucsf.edu/gc. 



identifying ESTs from members of closely related gene fami- 
lies (cutoff of expect score 10" 20 ). 

Optionally, masked sequence is searched against a data- 
base containing syntenic sequences of model organisms (in 
our case, mouse sequence from syntenic region of mouse 
chromosome 2). 

count_gene.pl and count_homol.pl are used to analyze out- 
put of the blast searches, creating a list of the number of 
relevant hits per AI. count_gene.pl also generates a first draft of 
sequence annotation data, by capturing all the database hits 
that exceed in length, a user-selectable threshold. If desired, 
this annotation can be extended and updated by the user 
manually. We capture the exact coordinates of regions of 
identity of database hits used for annotation. This informa- 
tion proved to be invaluable for analysis of the gene relation- 



ships, because the alignment of cDNA sequence to genomic 
sequence automatically yields intron-exon organization of 
the corresponding gene. 

Finally, graph.pl is used to gather information produced 
by gc_plot.pl (CpG distribution data) RepeatMasker (repeat 
distribution data), count_gene.pl (annotation and distribution 
of database hits) and countJiomol.pl (distribution of con- 
served regions) and produce a graphical summary. Currently 
we are working on the extension of graph.pl capabilities (to 
make output interactive and to add capability to include gene 
expression and copy number data from array-based experi- 
ments). The first version of the Genome Cryptographer soft- 
ware is accessible at http://kinase.ucsf.edu/gc. 

FISH Mapping 

FISH mapping was performed as described in Kallioniemi et 
al. (1992) and Stokke et al. (1995). Briefly, BAC DNA was 
extracted from overnight cultures and labeled with digoxi- 
genin-ll-dUTP by nick translation. Hybridization to meta- 
phase chromosomes was carried out in the presence of human 
Cotl DNA overnight and hybridized signal detected using 
anti-digoxigenin conjugated with FITC. Chromosomes were 
counterstained with DAPI to localize the hybridization signal. 

Microinjection and Fluorescence Microscopy 

ZNF217-EGFP (Clontech) cellular targeting was monitored af- 
ter microinjection of 10 ng/mL recombinant plasmid into 
HeLa cells grown on glass coverslips in 10% fetal calf serum in 
Dulbecco's modified Eagle's medium as previously described 
(Tominaga et al. 2000). Two hours after microinjection, cells 
were fixed and stained with Hoechst 33258 to visualize DNA 
and fluorescent images were captured with a SPOT CCD cam- 
era mounted on a Leica microscope equipped with a 100X 
oil-immersion objective. 

Quantitative PCR 

Quantitative PCR (Taqman) was performed as described pre- 
viously (Albertson et al. 2000). PCR primer and probe se- 
quences are as follows: 

ZNF217: Forward TTTTTCCGTTCAAATTATTACCTCAA, 
Reverse GCAGCATATTCACAAAATTCACATT, and the Taq- 
Man probe: FAM-CATCTCAGAACGCATACAGGTGAAAAAC 
CATAC-TAMRA. 

NABC3: Forward CTACGCTGTAGGACACACAGTGG, 
Reverse TAAATGGCGGTTGCAGTGGT, and the TaqMan 
probe: FAM-CAATAATACAGGACCCCCAAACTGGCCA 
TAMRA. 

PFDN4: Forward TTGGTGATGTCTTCATTAGCCATT, Re- 
verse TTCCACTCTGGATTCTAAGGCG, and the TaqMan 
probe: FAM-AAGAAACGCAAGAAATGTTAGAAGAAG 
CAAAGAAAAAT. 
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